11 research outputs found

    Self-supervised end-to-end ASR for low resource L2 Swedish

    Get PDF
    Publisher Copyright: Copyright © 2021 ISCA.Unlike traditional (hybrid) Automatic Speech Recognition (ASR), end-to-end ASR systems simplify the training procedure by directly mapping acoustic features to sequences of graphemes or characters, thereby eliminating the need for specialized acoustic, language, or pronunciation models. However, one drawback of end-to-end ASR systems is that they require more training data than conventional ASR systems to achieve similar word error rate (WER). This makes it difficult to develop ASR systems for tasks where transcribed target data is limited such as developing ASR for Second Language (L2) speakers of Swedish. Nonetheless, recent advancements in selfsupervised acoustic learning, manifested in wav2vec models [1, 2, 3], leverage the available untranscribed speech data to provide compact acoustic representation that can achieve low WER when incorporated in end-to-end systems. To this end, we experiment with several monolingual and cross-lingual selfsupervised acoustic models to develop end-to-end ASR system for L2 Swedish. Even though our test is very small, it indicates that these systems are competitive in performance with traditional ASR pipeline. Our best model seems to reduce the WER by 7% relative to our traditional ASR baseline trained on the same target data.Peer reviewe

    Yhtenäinen Vähäresurssinen Puheentunnistus Toisen Kielen Oppijoille

    No full text
    Apart from native speech, second language learners' (L2) speech is more difficult to recognize for automatic speech recognition (ASR) systems, since it is much more likely to contain lexical and grammatical errors, as well as disfluencies and mispronunciations. Furthermore, L2 ASR is challenging, because it is low-resource, meaning that the amount of training data is very limited. Unlike conventionally used Hidden Markov Model-based ASR systems, end-to-end ASR systems eliminate the need for separate components by directly mapping acoustic features to text. However, these systems require large amounts of labelled training data, which makes it difficult to apply them on L2 ASR. Recent advancements in self-supervised acoustic learning leverage the highly available untranscribed speech data to learn powerful acoustic representations which can be incorporated in end-to-end systems. This work explores and deploys mono- and multilingual self-supervised acoustic models on low-resource L2 ASR. In this thesis, the ASR systems are developed for L2 speakers of Finland-Swedish, Finnish, and German. Depending on the target language, the self-supervised end-to-end models provide a relative improvement of the word error rate by 31.3-45.1\% compared to the results of the conventional ASR systems. The results obtained in this thesis show the high performance and the promising potential of self-supervised end-to-end acoustic models for low-resource L2 ASR. In addition, this work is an important step in the development of automatic speaking assessment tools for L2 speakers, in which an accurate ASR system is a crucial component.Toisen kielen oppijoiden puheentunnistus on haastava tehtävä kielioppi- ja ääntämisvirheiden sekä puheen epäsujuvuuden vuoksi. Sen lisäksi kielenoppijoiden puheentunnistus on vähäresurssista, sillä toisen kielen oppijoiden puhedataa on niukasti saatavilla. Markovin piilomalleihin pohjautuvat puheentunnistimet vaativat yleensä komponenttiensa mukauttamista toisen kielen oppijoiden puheentunnistusta varten. Toisaalta yhtenäiset, end-to-end -periaatteeseen perustuvat neuroverkkomallit poistavat tarpeen erillisistä mukautetuista moduuleista kääntämällä akustisia piirteitä suoraan tekstiksi. Näitä malleja on kuitenkin vaikea soveltaa toisen kielen oppijoiden puheentunnistukseen, koska ne tarvitsevat paljon litteroitua opetusdataa. Viime aikoina kehitetyt itseohjatut neuroverkkomallit pystyvät oppimaan rikkaita puheen piirteitä hyödyntämällä runsaasti saatavilla olevaa litteroimatonta puhedataa. Nämä opitut piirteet mahdollistavat yhtenäisen puheentunnistusjärjestelmän opettamista myös pienemmällä määrällä litteroitua puhedataa. Tässä diplomityössä tutkitaan yksi- ja monikielisiä itseohjattuja neuroverkkomalleja sekä niiden soveltuvuutta toisen kielen oppijoiden puheentunnistukseen. Tässä työssä kehitetään puheentunnistimia suomenruotsin, suomen sekä saksan oppijoiden puheentunnistamiseen. Tässä diplomityössä opetetut itseohjatut puheentunnistusjärjestelmät pienentävät sanavirheastetta kielestä riippuen 31.3-45.1 prosenttiyksikköä verrattuna perinteisiin malleihin. Tässä työssä saavutetut tulokset osoittavat, että itseohjattuja yhtenäisiä puheentunnistusmalleja on mahdollista käyttää tehokkaasti vähäresurssiseen toisen kielen oppijoiden puheentunnistukseen. Tämä työ on myös tärkeä askel kohti automaattisen toisen kielen puhumisen arviointijärjestelmän kehittämistä, jossa tarkka puheentunnistin on olennainen osa systeemiä

    Automated Writing Support for Swedish Learners

    No full text
    This paper describes a tool developed for lexical and grammatical analysis of Swedish text and providing automated feedback for language learners. The system looks for words and word sequences that are likely to contain errors and suggests how to correct them using different non-neural models. The feedback consists of alternative word and word sequence suggestions and morphological features which need to be corrected. Although the system is able to provide reasonable feedback which is believed to be useful for language learners, it still needs further improvements to address the drawbacks such as low precision.Peer reviewe

    Wav2vec2-based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering

    No full text
    With the rapid advancement in automatic speech recognition and natural language understanding, a complementary field (paralin- guistics) emerged, focusing on the non-verbal content of speech. The ACM Multimedia 2022 Computational Paralinguistics Challenge introduced several exciting tasks of this field. In this work, we focus on tackling two Sub-Challenges using modern, pre-trained models called wav2vec2. Our experimental results demonstrated that wav2vec2 is an excellent tool for detecting the emotions behind vocalisations and recognising different types of stutterings. Albeit they achieve outstanding results on their own, our results demonstrated that wav2vec2-based systems could be further improved by ensembling them with other models. Our best systems outperformed the competition baselines by a considerable margin, achieving an unweighted average recall of 44.0 (absolute improvement of 6.6% over baseline) on the Vocalisation Sub-Challenge and 62.1 (absolute improvement of 21.7% over baseline) on the Stuttering Sub-Challenge.Peer reviewe

    Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model

    No full text
    Self-supervised speech models, such as the wav2vec2, have become extremely popular in the past few years. Their main appeal is that after their pre-training on a large amount of audio, they require only a small amount of supervised, finetuning data to achieve outstanding results. Despite their immense success, very little is understood about the pre-trained models and how finetuning changes them. In this work, we take the first steps towards a better understanding of wav2vec2 systems using model interpretation tools such as visualization and latent embedding clustering. Through our analysis, we gain new insights into the abilities of the pre-trained networks and the effect that finetuning has on them. We demonstrate that the clusters learned by the pre-trained model are just as important a factor as the supervised training data distribution in determining the accuracy of the finetuned system, which could aid us in selecting the most suitable pre-trained model for the supervised data.Peer reviewe

    wav2vec2-based Speech Rating System for Children with Speech Sound Disorder

    No full text
    Speaking is a fundamental way of communication, developed at a young age. Unfortunately, some children with speech sound disorder struggle to acquire this skill, hindering their ability to communicate efficiently. Speech therapies, which could aid these children in speech acquisition, greatly rely on speech practice trials and accurate feedback about their pronunciations. To enable home therapy and lessen the burden on speech-language pathologists, we need a highly accurate and automatic way of assessing the quality of speech uttered by young children. Our work focuses on exploring the applicability of state-of-the-art self-supervised, deep acoustic models, mainly wav2vec2, for this task. The empirical results highlight that these self-supervised models are superior to traditional approaches and close the gap between machine and human performance.Peer reviewe

    Lahjoita puhetta : a large-scale corpus of spoken Finnish with some benchmarks

    No full text
    Publisher Copyright: © 2022, The Author(s).In 2020-2021, the Donate Speech campaign gathered approximately 3600 h of ordinary, colloquial Finnish speech for the Lahjoita puhetta (Donate Speech) corpus, which includes over twenty thousand speakers from all the regions of Finland and from all age brackets. The goals of the collection were to create a representative, large-scale resource of spontaneous spoken Finnish to accelerate the development of language technology and speech-based services.Peer reviewe

    New data, benchmark and baseline for L2 speaking assessment for low-resoure languages

    No full text
    The development of large multilingual speech models provides the possibility to construct high-quality speech technology even for low-resource languages. In this paper, we present the speech data of L2 learners of Finnish and Finland Swedish that we have recently collected for training and evaluation of automatic speech recognition (ASR) and speaking assessment (ASA). It includes over 4000 recordings by over 300 students per language in short read-aloud and free-form tasks. The recordings have been manually transcribed and assessed for pronunciation, fluency, range, accuracy, task achievement, and a holistic proficiency level. We present also an ASR and ASA benchmarking setup we have constructed using this data and include results from our baseline systems built by fine-tuning self-supervised multilingual model for the target language. In addition to benchmarking, our baseline system can be used by L2 students and teachers for online self-training and evaluation of oral proficiency.nonPeerReviewe

    Developing an AI-assisted Low-resource Spoken Language Learning App for Children

    Get PDF
    Publisher Copyright: AuthorComputer-assisted Language Learning (CALL) is a rapidly developing area accelerated by advancements in the field of AI. A well-designed and reliable CALL system allows students to practice language skills, like pronunciation, any time outside of the classroom. Furthermore, gamification via mobile applications has shown encouraging results on learning outcomes and motivates young users to practice more and perceive language learning as a positive experience. In this work, we adapt the latest speech recognition technology to be a part of an online pronunciation training system for small children. As part of our gamified mobile application, our models will assess the pronunciation quality of young Swedish children diagnosed with Speech Sound Disorder, and participating in speech therapy. Additionally, the models provide feedback to young non-native children learning to pronounce Swedish and Finnish words. Our experiments revealed that these new models fit into an online game as they function as speech recognizers and pronunciation evaluators simultaneously. To make our systems more trustworthy and explainable, we investigated whether the combination of modern input attribution algorithms and time-aligned transcripts can explain the decisions made by the models, give us insights into how the models work and provide a tool to develop more reliable solutions.Peer reviewe
    corecore